Skip to content

IO: Fix parquet read from s3 directory #33632

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Apr 26, 2020

Conversation

alimcmaster1
Copy link
Member

@alimcmaster1 alimcmaster1 commented Apr 18, 2020

(Seems to have also fixed the xfailing test in #33077)

NOTE: lets merge #33645 first - since that fixes up a crucial bit of error handling around this functionality.

@alimcmaster1 alimcmaster1 added the IO Parquet parquet, feather label Apr 18, 2020
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this!

path.close()

parquet_ds = self.api.parquet.ParquetDataset(
path, filesystem=get_fs_for_path(path), **kwargs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this filesystem=get_fs_for_path(path) needed? What happens if you just pass the path? (which I assume has eg a s3://.. in it?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyarrow seems to only allow a file path opposed to a dir path. Removing filesystem arg here throws:

            for path in path_or_paths:
                if not fs.isfile(path):
                    raise IOError('Passed non-file path: {0}'
>                                 .format(path))
E                   OSError: Passed non-file path: s3://pandas-test/parquet_dir

../../../.conda/envs/pandas-dev/lib/python3.7/site-packages/pyarrow/parquet.py:1229: OSError

To repo see the test case test_s3_roundtrip_for_dir I wrote below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK. I see now in pyarrow that apparently string URIs with "s3://..." are not supported (while "hdfs://" is supported). That's something we should fix on the pyarrow side as well. But of course until then this is fine.

@jreback
Copy link
Contributor

jreback commented Apr 21, 2020

can you rebase

@@ -92,8 +97,7 @@ def write(
**kwargs,
):
self.validate_dataframe(df)
path, _, _, _ = get_filepath_or_buffer(path, mode="wb")

file_obj, _, _, _ = get_filepath_or_buffer(path, mode="wb")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche think we can clean up the write method here to get rid of get_filepath_or_buffer similar to what i've done below for read. Will address in different PR.

@alimcmaster1
Copy link
Member Author

can you rebase

sure merged + green

@jorisvandenbossche jorisvandenbossche added this to the 1.1 milestone Apr 22, 2020
@@ -92,7 +97,7 @@ def write(
**kwargs,
):
self.validate_dataframe(df)
path, _, _, should_close = get_filepath_or_buffer(path, mode="wb")
file_obj, _, _, should_close = get_filepath_or_buffer(path, mode="wb")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you didn't change path to file_obj in the if partition_cols is not None: block. Was that on purpose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was indeed on purpose, write_to_dataset doesn't support a file like object when that path is a directory.

import pyarrow.parquet 
import pandas as pd
from pandas.io.common import get_filepath_or_buffer

path = "s3://pandas-test/dev"
file_obj, _,_,_, = get_filepath_or_buffer(path)
df = pd.DataFrame({"a": [1,2], "b": [3,4]})
table = pyarrow.Table.from_pandas(df) 

# Works
pyarrow.parquet.write_to_dataset(table, path, partition_cols=["a"])

# Throws AttributeError: 'NoneType' object has no attribute '_isfilestore
pyarrow.parquet.write_to_dataset(table, file_obj, partition_cols=["a"])

@jorisvandenbossche

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, can you add some clarifying comments for it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure done!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this always a file_obj, never a path? e.g. should rename to filepath_or_buffer ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, i've renamed to file_obj_or_path since when a local path is passed in a path str is returned.

Add clarifying comment
@pep8speaks
Copy link

pep8speaks commented Apr 25, 2020

Hello @alimcmaster1! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-04-26 20:58:49 UTC

@@ -150,6 +150,23 @@ def urlopen(*args, **kwargs):
return urllib.request.urlopen(*args, **kwargs)


def get_fs_for_path(filepath):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type this (and the return annotation)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left return type for now since it include optional dependencies.

e.g Union[s3fs.S3FileSystem, gcsfs.GCSFileSystem, None]

Can add imports to the TYPE_CHECKING block at the top if that's appropriate?

def get_fs_for_path(filepath):
"""
Get appropriate filesystem given a filepath.
Support s3fs, gcs and local disk fs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this a full doc-string Paramateres / Returns

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure done :)

@@ -92,7 +97,7 @@ def write(
**kwargs,
):
self.validate_dataframe(df)
path, _, _, should_close = get_filepath_or_buffer(path, mode="wb")
file_obj, _, _, should_close = get_filepath_or_buffer(path, mode="wb")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this always a file_obj, never a path? e.g. should rename to filepath_or_buffer ?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc-string comment + need to merge master

@@ -585,6 +585,8 @@ I/O
unsupported HDF file (:issue:`9539`)
- Bug in :meth:`~DataFrame.to_parquet` was not raising ``PermissionError`` when writing to a private s3 bucket with invalid creds. (:issue:`27679`)
- Bug in :meth:`~DataFrame.to_csv` was silently failing when writing to an invalid s3 bucket. (:issue:`32486`)
- :func:`read_parquet` now supports an s3 directory (:issue:`26388`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you review the doc-strings to see if they need updating (e.g. may need a versionadded tag)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet Docs strings indicate we already supported this I think? I updated the whatsnew and added an example in docs strings.

@jreback jreback merged commit 22cf0f5 into pandas-dev:master Apr 26, 2020
@jreback
Copy link
Contributor

jreback commented Apr 26, 2020

thanks @alimcmaster1

rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request May 10, 2020
simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this pull request May 14, 2020
simonjayhawkins added a commit that referenced this pull request May 14, 2020
@simonjayhawkins simonjayhawkins modified the milestones: 1.1, 1.0.4 May 26, 2020
if should_close:
path.close()

parquet_ds = self.api.parquet.ParquetDataset(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alimcmaster1

This change breaks clients that pass a file-like object for path. ParquetDataset doesn't provide the same file-like object handling that the original get_filepath_or_buffer did.

Here's the call stack I'm seeing:

.tox/test/lib/python3.7/site-packages/pandas/io/parquet.py:315: in read_parquet
    return impl.read(path, columns=columns, **kwargs)
.tox/test/lib/python3.7/site-packages/pandas/io/parquet.py:131: in read
    path, filesystem=get_fs_for_path(path), **kwargs
.tox/test/lib/python3.7/site-packages/pyarrow/parquet.py:1162: in __init__
    self.paths = _parse_uri(path_or_paths)
.tox/test/lib/python3.7/site-packages/pyarrow/parquet.py:47: in _parse_uri
    path = _stringify_path(path)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed bug report #34467

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants